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This Talk 



© Optimizing physics simulation on a multi-core 
architecture. 

® Focus on CELL architecture 

© Variety of simulation domains 

® Cloth, Rigid Bodies, Fluids, Particles 

© Practical advice based on real case-studies 
© Demos! 
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Basic Issues 



Looking for opportunities to parallelize processing 

® High Level - Many independent solvers on multiple cores 
® Low Level - One solver, one/multiple cores 

Coding with small memory in mind 
® streaming 
® Batching up work 
® Software Caching 

Speeding up processing within each unit 

® SIMD processing, instruction scheduling 
® Double-buffering 

Parallelizing/optimizing existing code 
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What is not in this talk? 



Details on specific physics algorithms 

® Too much material for a 1-hour talk 
® Will provide references to techniques 

Much insight on non-CELL platforms 
® Concentrate on actual results 
® Concepts should be applicable beyond CELL 
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The Cell Processor Model 



Main Memory 


L1/L2 

PPU 
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Physics on CELL 


SPUl I SPU2 

256K LS I 256K LS 
DMA I DMA 


DMA ■ DMA 
256K LSi I 256K LS 

SPU5 I SPU6 


Physics should happen mostly on SPUs 

@ There’s more of them! 

@ SPUs have greater bandwidth & performance 
@ PPU is busy doing other stuff 


DMA 
256K LS 

SPU7 



Main Memory 


SPUO 




SPU4 
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SPU Performance Recipe 

® Large bandwidth to and from main memory 
® Quick (1-cycle) LS memory access 
® SIMD instruction set 
® Concurrent DMA and processing 
® Challenges: 

@ Limited LS size, shared between code and data 
@ Random accesses of main memory are slow 
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Cloth Simulation 
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Cloth Simulation 




Cloth mesh simulated as point masses 
(vertices) connected via distance constraints 
(edges). 


Mesh Triangle 


® References: 

® T .Jacobsen Advanced Character Physics, GDC 2001 
® A.Meggs, Taking Reai-Time Cloth Beyond Curtains, GDC 2005 
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Simulation Step 




1 . Compute external forces, f^,per vertex 

2 . Compute new vertex positions [ Integration ]: 


3. Fix edge lengths 

® Adjust vertex positions 

4. Correct penetrations with collision geometry 

® Adjust vertex positions 


GameDevelopers 

Conference 


How many vertices? 



© How many vertices fit in 256K (less actually)? 

® A lot, surprisingly... 

© Tips: 

® Look for opportunities to stream data 
® Keep in LS only data required for each step 
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Integration Step 




{2p*-p* + 

t t ! ! 

16 + 16 + 16 + 4 = 52 bytes I vertex 


© Less than 4000 verts in 200K of memory 

© We don’t need to keep them all in LS 

© Keep vertex data in main memory and bring it 
in in blocks 
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Streaming Integration 



Main Memory 




BO 

B1 

B2 

B3 

BO 

B1 

B2 

B3 

BO 

B1 

B2 

B3 

BO 

B1 

B2 

B3 


Local Store 



GameDevelopers 

Conference 



















UUMfiT'S N£ 



Streaming Integration 


Main Memory 
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streaming Integration 


Main Memory 
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streaming Integration 


Main Memory 


pt 

BO 
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Streaming Integration 


Main Memory 
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streaming Integration 


Main Memory 
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streaming Integration 


Main Memory 


pt 
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I DMA„OUT I DMA_IN I 
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streaming Integration 


Main Memory 
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I DMA„OUT I DMA_IN I 
BO I B1 I 
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Double-buffering 


© Take advantage of concurrent DMA and 
processing to hide transfer times 


Without double-buffering: 


DMAJN 

BO 


I 


Process BO 


I DMA_OUT I DMAJN I 
BO I B1 I 


DMAJN 
B1 


Process B1 




DMA_OUT I 
B1 


With double-buffering: 


Process BO ■ Process B1 ■ Process B2 






DMAJN 

BO 


DMAJN 

B1 


DMA_OUT 

BO 

DMAJN 

B2 

■ 

DMA_OUT 

B1 

DMA_IN 

B3 
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Streaming Data 



© Streaming is possible when the data access 

pattern is simple and predictable (e.g. linear) 

® Number of verts processed per frame depends on 
processing speed and bandwidth but not LS size 

© Unfortunately, not every step in the cloth 
solver can be fully streamed 

® Fixing edge lengths requires random memory 

3.CCGSS . . . 
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Fixing Edge Lengths 


© Points coming out of the integration step don’t 
necessarily satisfy edge distance constraints 


Struct Edge 

{ 

int vl; 
int v2; 

float restLen; 

} 

Vectors d = p[v2] - p[vl]; 
float len = sqrt(dot(d,d)); 
diff = (len-restLen)/len; 
p[vl] -= d * 0.5 * diff; 
p[v2] += d * 0.5 * diff; 



p[vl] p[v2] 
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Fixing Edge Lengths 

® An iterative process: Fix one edge at a time 
by adjusting 2 vertex positions 

® Requires random access to particle positions 
array 

® Solution: 

@ Keep all particle positions in LS 
@ Stream in edge data 

@ In 200K we can fit 200KB / 16B > 12K vertices 
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Rigid Bodies 



Our group is currently porting the AGEIA™ 
PhysX™ SDK to CELL 

Large codebase written with a PC 
architecture in mind 

@ Assumes easy random access to memory 
@ Processes tasks sequentially (no parallelism) 

Interesting example on how to port existing 
code to a multi-core architecture 


GameDevelopers 

Conference 


Ul|-IPT’5 NE 



Starting the Port 

® Determine all the stages of the rigid body 
pipeline 

® Look for stages that are good candidates for 
parallelizing/optimizing 

® Profile code to make sure we are focusing on 
the right parts 
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Rigid Body Pipeline 


Current body positions 



Constraint Equations 
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Profiling Scenario 
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Profiling Results 

Cumulative Frame Time 
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Running on the SPUs 



© Three steps: 

1 . (PPU) Pre-process 

® “Gather” operation (extract data from PhysX data 
structures and pack it in MM) 

2 . (SPU) Execute 

<s DMA packed data from MM to LS 
@ Process data and store output in LS 
@ DMA output to MM 

3. (PPU) Post-process 

® “Scatter” operation (unpack output data and put back in 
PhysX data structures) 
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Why Involve the PPU? 


® Required PhysX data is not conveniently packed 

® Data is often not aligned 

® We need to use PhysX data structures to avoid 
breaking features we haven’t ported 

® Solutions: 

@ Use list DMAs to bring in data 
@ Modify existing code to force alignment 
@ Change PhysX code to work with new data structures 
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Batching Up Work 


” o 
**■% £_) 


Create work batches for each task 


PPU 


SPU 


PPU 

Pre-Process 


Execute 


Post-Process 


t i 


Work batch 
buffers in MM 


PhysX 

data-structures 



PhysX 

data-structures 
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Narrow-phase Collision Detection 



© Problem: 

® A list of object pairs that may be colliding 
® Want to do contact processing on SPUs 
® Pairs list has references to geometry 
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Narrow-phase Collision Detection 

© Data locality 

® Same bodies may be in several pairs 
® Geometry may be instanced for different bodies 

© SPU memory access 

® Can only access main memory with DMA 
® No hardware cache 
® Data reuse must be explicit 
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Software Cache 


Idea: make a (read-only) software cache 
® Cache entry is one geometric object 
® Entries have variable size 

Basic operation 
® SPU checks cache for object 
® If not in cache, object fetched with DMA 
® Cache returns a local address for object 
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Software Cache 


Data Structures 

® Two entry buffers 

® New entries appended to “current” buffer 
@ Hash-table used to record and find loaded entries 


Next DMA 


Buffer 0 Buffer 1 

A 


C 


B 
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Software Cache 


Data Replacement 

® When space runs out in a buffer 

@ Overwrite data in second buffer 

® Considerations 

@ Does not fragment memory 
@ No searches for free space 
@ But does not prefer frequently used data 
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Software Cache 


Hiding the DMA latency 

® Double-buffering C urrent Buf fer 

@ start DMA for un-cached entries 
@ Process previously DMA’d entries 

® Process/pre-fetch batches 

® Fetch and compute times vary 
® Batching may improve balance 
® DMA-lists useful 

® One DMA command 
® Multiple chunks of data gathered 



Process 


DMA 
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Software Caching 

® Conclusions 

® Simple cache is practical 

@ Used for small convex objects in PhysX 

® Design considerations 

® Tradeoff of cache-logic cycles vs. bandwidth saved 
@ Pre-fetching important to include 
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Single SPU Performance 


PPU only: 


PPU 


Exec 


PPU + SPU: 


PPU 




Free 


SPU 



SPU Exec < PPU Exec: SIMD + fast mem access 
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Multiple SPU Performance 




© Pre- and Post- processing times determine 
how many SPUs can be used effectively 
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Multiple SPU Performance 


PPU 


ISPU 


2SPUS 



SSPUs 
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microseconds 


PPU vs SPU comparisons 



Convex Stack (500 boxes) 



— PPU-only 

— 1-SPU 

2- SPUs 

3- SPUs 
4-SPUs 
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Duck Demo 


© One of our first CELL demos (spring 2005) 
© Several interacting physics systems: 

@ Rigid bodies (ducks & boats) 

® Height-field water surface 
® Cloth with ripping (sails) 

® Particle based fluids (splashes + cups) 


GameDevelopers 

Conference 


Duck Demo (Lots of Ducks) 
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Duck Demo 



© Ambitious project with short deadline 
© Early PC prototypes of some pieces 
© Most straightforward way to parallelize: 

@ Dedicate one SPU for each subsystem 

© Each piece could be developed and tested 
individually 
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Duck Demo Resource Allocation 


@ PU - main loop 

® SPU thread synchronization, draw calls 
@ SPUO - height field water (<50%) 

@ SPUl - splashes iso-surface (<50%) 

@ SPU2 - cloth sails for boat 1 (<50%) 

@ SPU3 - cloth sails for boat 2 (<50%) 

@ SPU4 - rigid body collision/response (95%) 


■* — 1 frame — ► 



HF water 


Iso-Surface 


Cloth 


Cloth 


Rigid Bodies 
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Parallelization Recipe 



One three-step approach to code 
parallelization: 


1. Find independent components 
t’ 2. Run them side-by-side 
3 . Recursively apply recipe to components 
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Challenges 



Step 1: Find independent components 
© Where do you look? 

© Maybe you need to break apart and overlap 
your data? 

-> e.g. Broad phase collision detection 

© Maybe you need to break apart your loop into 
individual iterations? 

-> e.g. Solving cloth constraints 
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Broad Phase Collision Detection 


Need to test 600 rigid bodies against each other. 


600 Objects 



200 Objects A 

200 Objects B 

200 Objects C 


200 Objects A 

VS 

200 Objects B 




200 Objects A 

vs 

200 Objects C 




200 Objects B 

vs 

200 Objects C 


We can execute 
all three of 
these 

simultaneously 
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Cloth Solving 


for (i=l to 5) { 
cloth=solve(cloth) 

} 

for (i=l to 5) { 

solve_on_procl(a); 

solve_on_proc2(b); 

wait_for_all() 

solve_on_procl(c); 

wait_for_all(): 

I GameDevelopers 
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...challenges 


Step 2: Run them side-by-side 
© Bandwidth and cache issues 

-> Need good data layout to avoid thrashing cache 
or bus 

© Processor issues 

-> Need efficient processor management scheme 

© What if the job sizes are very different? 

e.g. a suit of cloth and a separate neck tie 

-> Need further refinement of large jobs, or you only 
save on the small neck tie time 
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...challenges 



® step 3: Recurse 
® When do you stop? 

^ Overhead of launching smaller jobs 
^ Synchronization when a stage is done 

e.g. Gather results from all collision detection before 
solving 

® But this can go down to the instruction level 

e.g. Using Structure-of-Arrays, transform four 
independent vectors at once 
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High Level Parallelization: 

Duck Demo 
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Fluid Simulation 

Fluid Surface 

Rigid Bodies 

Cloth Sails 



Dependency exists 


Fluid Simulation 


Fluid Surface 



But cloth was for 
multiple boats 


Note that the parts didn’t take an 
equal amount of time to run. We 
could have done better given time! 
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Lower Level Parallelization 
Rigid Body Simulation 






Broad Phase 
Collision Detection 

Proc 1 

Proc 2 

Proc 3 

Objects A 

Objects A 

Objects B 

Objects B 

Objects C 

Objects C 


Narrow Phase 
Collision Detection 


Proc 1 

Proc 2 

Proc 3 
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Constraint Solving 


Proc 1 

n 

Proc 2 

n 

Proc 3 

n 

1 1 1 1 1 

1 1 1 1 1 

1 1 1 1 1 

1 1 1 1 1 1 

1 1 1 1 1 1 

1 1 1 1 1 1 
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Structure of Arrays 


Array of Structures 
or “AoS” 


Data[0] 

Data[l] 

Data[2] 

Data[3] 

Data[4] 

Data[5] 

Data[6] 

Data[7] 


Structure of Arrays 
or “SoA” 



o 

X 

Xi 

X2 

X3 

X 

o 

Yi 

Y2 

Y3 

o 

N 


Z2 

Z3 

Wo 

Wa 

W2 

W3 

X4 

X5 

Xe 

X7 

Y4 

Y5 

Ye 

Y7 

Z4 

Z5 

Ze 

Z7 

W4 

W5 

We 

W7 


Since W is almost always 0 or 1, we can eliminate it with a 
clever math library and save 25% memory and bandwidth! 


> 1 SoA Vector 
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Lowest Level Parallelization: 

Structure-of-Array processing of Particles 
Given: 

Pn(t)=position of particle n at time t 
Vn(t)=velocity of particle n at time t 


Pi(ti)=Pift-i) + Vi(ti_i) * dt + 0.5 * G * dt2 
P2(ti)=P2ft-i) + V2(ti,i) * dt + 0.5 * G * dt2 


Note they are independent of each other 
So we can run four together using SoA 

P{i-4}(ti)=P{i-4}(ti-i) + * dt + 0.5 * G * dt2 
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Failure Case 

Gauss Seidel Solver 




Consider a simple position-based solver that 
uses distance constraints. Given: 

p=current positions of all objects 

solve(Cn, p) takes p and constraint Cn and computes a new p 
that satisfies Cn 


p=solve(Co, p) 
p=solve(Ci, p) 

Note that to solve c^, we need the result of Cq. 
Can’t solve Cq and c^ concurrently! 
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Failure Case 

Possible Solutions 

Generally, it’s you’re out of luck, but... 

© Some cases have very limited dependencies 

e.g. particle-based cloth solving 

^ Solution: Arrange constraints such that no four 
adjacent constraints share cloth particles 

© Consider a different solver 

e.g. Jacobi solvers don’t use updated values until all 
constraints have been processed once 

@ But they need more memory and Pcurrem) 

@ And may need more iterations to converge 


GameDevelopers 

Conference 


Duck Demo (EyeToy + SPH) 
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Smoothed Particle Hydrodynamics 

(SPH) Fluid Simulation 
© Smoothed-particles 

® Mass distributed around a point 
® Density falls to 0 at a radius h 



© Forces between particles closer than 2h 
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SPH Fluid Simulation 


© High-level parallelism 


@ Put particles in grid cells 
@ Process on different SPUs 
® (Not used in duck demo) 

© Low-level parallelism 

@ SIMD and dual-issue on SPU 
@ Large n per cell may be better 



@ Less grid overhead 
@ Loops fast on SPU 
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SPH Loop 



© Consider two sets of particles P and Q 
@ E.g., taken from neighbor grid cells 
@ 0 (n 2 ) problem 

© Can unroll (e.g., by 4) 

for (i = 0; i < numP; i++) 
for 0 = 0; j < numQ: j+=4) 

Compute force (P|, qj) 

Compute force (P|, qj+J 

Compute force (P|, qj+ 2 ) 

Compute force (P|, qj+ 3 ) 


GameDevelopers 

Conference 


WMPT’5 NE 


SPH Loop, SoA 



© Idea: 

@ Increase SIMD throughput with structure-of-arrays 
@ Transpose and produce combinations 


Pi 

qj 

qj+i 

qj+2 

qj+3 
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SPH Loop, Software Pipelined 


© Add software pipelining 

@ Conversion instructions can dual-issue with math 


Load[i] 
To SoA[i] 


Compute[i] 


From SoA[i] 
Store[i] 


Pipe 0 Pipe 1 


Compute[i] 

From SoA[i-l] 

Store [i-1] 

Load[i+l] 


To SoA[i+l] 
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Recap 


Finding independence is hard! 

® Across subsystems or within subsystems? 

® Across iterations or within iterations? 

® Data level independence? 

® Instruction level independence? 

® How about “bandwidth level” independence? 

Parallelization overhead 

® Sometimes running serially wins over overhead of 
parallelization 
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Particle Simulation Demo 



GameDevelopers 

Conference 





WMPT'S NE 


Questions? 



http://www.research.scea.com/ 


Contacts: 

Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com 
Eric Larsen: eric_larsen@playstation.sony.com 
Steven Osman: steven_osman@playstation.sony.com 
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